The European Tertiary Education Register, the reference dataset on European Higher Education Institutions

The European Tertiary Education Register (ETER) is the reference dataset on European Higher Education Institutions (HEIs). ETER provides data on nearly 3,500 HEIs in about 40 European countries, including descriptive information, geographical information, students and graduates (with various breakdowns), revenues and expenditures, personnel, and research activities; as of March 2023, data cover the years from 2011–2020. ETER complies with OECD-UNESCO-EUROSTAT standards for educational statistics; most data are collected from National Statistical Authorities (NSAs) or ministries of participating countries and are subject to extensive checks and harmonization. The development of ETER has been funded by the European Commission and is part of the current efforts to establish a European Higher Education Sector Observatory; it is closely connected to the establishment of a broader data infrastructure in the field of science and innovation studies (RISIS). The ETER dataset is widely used in the scholarly literature on higher education and science policy, as well as for policy reports and analyses.


Background & Summary
Traditionally, data on Higher Education have been provided at the level of whole countries by statistical authorities and international organizations, such as the OECD and UNESCO, to enable comparison of the performance of national systems in terms of educational attainments or research output 1 . To this aim, a set of standards and classifications to achieve international comparability have been developed, such as the UNESCO-OECD-EUROSTAT manual on educational statistics 2 (UOE) and the Frascati manual on research and development 3 . These methodologies have constituted the basis for extensive international comparisons based on a common definition of educational levels or research performing sectors 4,5 .
Yet, national systems are composed of individual Higher Education Institutions (HEIs) that are "as different as chalk and cheese" 6 . While US higher education was always characterized by high institutional diversity 7 , most European systems were historically composed of a rather homogeneous set of (PhD awarding) universities under tight state control 8 and with low autonomy 9 . This changed rapidly from the 1970s onward. On the one hand, higher education systems expanded to include different types of professionally oriented institutions, such as colleges and universities of applied sciences 10 . On the other hand, new governance models inspired by New Public Management 11 emphasized the strategic autonomy of HEIs 12 and, accordingly, the notion of competitive and strategic differentiation 13 . Consequently, individual HEIs and their heterogeneity have become a relevant unit/level of analysis 14 .
While the systematic collection of institutional-level data in the US started already in the 1960s and evolved to an integrated data system in the 1980s already 15 , no similar system existed in Europe until recently. Data at the level of individual HEIs were mostly collected for administrative purposes and not always made available to the general public; national fragmentation also made the comparability of national data problematic 16 . Even EtER usage in scholarly literature and policy reports. In just a few years, the ETER dataset has become widely used in scholarly literature on European higher education; a search in Google Scholar retrieved as of April 2023 more than 300 documents referring to ETER as a data source. While some papers use ETER data directly for their analysis, others either exploit ETER to create a sample for collecting additional data and integrate them with data from ETER to characterize the HEIs studied.
ETER has been used, for example, to analyze the internationalization process of European HEI 21,22 , to study the staff composition of HEIs [23][24][25] and to analyze the mobility of students in Europe 26 .
Further, ETER enabled the first systematic insights into the diversity of HEIs 17,27 and national systems in Europe 28 , paving the way for the first attempts to develop an empirical classification of European HEIs 14 . When combined with similar data from the US and with bibliometric information, ETER aided understanding of the 'performance gap' of European HEIs 29 and enabled demonstrating that the position of HEIs in international rankings is strongly associated with institutional budgets 30 .
Two important areas where ETER data have been used have been the study of the efficiency of HEIs 31 and that of the contribution of higher education to regional economic development 32 and to the formation of spin-off companies 33 .
Beyond these broad studies, ETER has also been used to characterize specific groups of HEIs, such as military schools 34 , technical universities 35 or members of the European Universities initiative 36 , and for specific topics, such as language use 37 and digitalization of HEIs 38 .
Not less important has been the increasing use of ETER for policy analysis and evaluation undertaken by the European Commission and its contractors on topics such as analyzing academic personnel 39 , measuring the travel distance of citizens to HEIs in Europe 40 , studying higher education funding in Europe 41 , or analyzing determinants of students' mobility 42 .
These examples of scholarly work and policy reporting utilizing ETER display the breadth of the topics analyzed and, accordingly, emphasize the function of ETER as a basic data infrastructure, which provides core data on HEIs such as their geographical position, size, and the number of students.

Methods
ETER data collection process involved a Europe-wide collection of secondary data, primarily that processed by national authorities (NSAs or ministries) for the data collection on educational statistics coordinated by the European Statistical Agency EUROSTAT. As discussed below, National Authorities delivered data directly to ETER.
Additionally, the project team has extracted some of the data from institutional Websites and Wikipedia entries of the HEIs included in ETER: this concerns information such as the foundation year, main seat, institutional history.
Since ETER is based on data generated for the data collection on educational statistics, definitions for variables are aligned as much as possible with the UNESCO/OECD/EUROSTAT (UOE) manual on educational statistics 2 . This common set of definitions and standards ensures comparability and consistency across countries and institutions and is codified in a detailed methodological handbook available on Zenodo 43 .
More detailed information on the earlier infrastructure developments and a description of decisions made on design, definition and methodology can be found in 44 . the data collection process. Figure 1 gives an overview of the data collection process and its different phases.
www.nature.com/scientificdata www.nature.com/scientificdata/ Perimeter validation. The first step in the data collection was the definition of the perimeter, i.e., the list of institutions to be included in the data collection. Tertiary education is defined in educational statistics by the level of educational programs (International Standard Classification of Educational Degrees ISCED levels 5 to 8). Such programs include professional education as well as doctorate degrees, resulting in several hundreds or thousands of institutions, even for middle-sized countries. However, many tertiary education activities take place in a large number of relatively small institutions, but a major part of higher education, measured in student or graduate numbers, can be found in a smaller number of larger institutions. To make data collection feasible, the ETER project team used delimitation criteria for the inclusion of institutions: • major activity, i.e., education at ISCED levels 6, 7 and/or 8 is a major activity and part of an institution's mission; • graduation at ISCED levels 6, 7 and/or 8; • national recognition as a higher education institution.
For unclear cases, the following additional criteria were used: (1) size and visibility, i.e., institutions with less than 30 full-time equivalents of academic personnel and less than 200 students could be excluded, except the institution is primarily awarding degrees at ISCED level 8 (leading to a PhD or equivalent award), and (2) continuity, i.e., being a stable organization.
In the perimeter validation process, the ETER project team created perimeter validation files using RISIS-OrgReg as the basis. Basic data include institutional descriptors and geographic information, which are synchronized into ETER on a daily basis. National authorities received a proposed list of institutions from the ETER team, already including demographic events derived from RISIS-OrgReg. National authorities then validated the list by either confirming or adapting the list of institutions and corresponding demographic events. The validated perimeters were used to update RISIS-OrgReg and create the ETER data collection files.
The annual ETER data collection. The ETER team created data collection files in MS Excel, which included prefilled information for descriptors, geographic data and other variables which are not likely to change (e.g., lowest, and highest degree delivered). The data deliverer could update prefilled values at any time. The data collection sheets also included extensive control mechanisms. The data collection sheets were then shared with the data deliverer on a cloud-based collaboration platform. There, the data deliverer could access and edit the files. Data were uploaded to the ETER dataset when data collection files were finalized.
Indicator and monetary value calculation. Indicators as well as some monetary values were calculated during the import process and stored in the ETER database. Monetary values were delivered in the national currency www.nature.com/scientificdata www.nature.com/scientificdata/ and converted into Euros and Purchasing Power Parities (PPP) during data import. Monetary conversion and indicators are the only values in the database generated by the ETER project team.
Addition of data from external sources. ETER includes data on the institutional level, which can be matched with other institutional datasets. After matching, several of these datasets were added to the ETER dataset and updated manually alongside the main data collection on an annual basis or via an API.

Data Records
The ETER dataset currently includes nearly 3,500 HEIs in 41 European countries, i.e., EU-27 member states, European Economic Area countries (Iceland, Liechtenstein, and Norway), Switzerland, the UK, and most of the candidate and potential EU candidate countries (North Macedonia, Serbia, Montenegro, Kosovo, Bosnia and Herzegovina, Albania, and Türkiye). Some data is also available for Andorra and the Holy See. Data are collected annually and, as of spring 2023, cover the period 2011-2020 for most countries (with some missing years particularly in the Balkan countries and in France).
As a rule, ETER should include all HEIs in the countries within the perimeter which deliver at least a bachelor's degree (ISCED level 6), and which exceed a certain size. These restrictions consider the workload for data collection and data availability. The exact national perimeters are agreed upon with data providers (i.e., NSAs or ministries) based on the national status of HEIs. As compared with the number of tertiary education students in EUROSTAT, ETER coverage is 73% at ISCED 45 level 5 (short diplomas), 95% at ISCED6 (bachelor) and 100% at ISCED7 (master). In terms of types of HEIs, the ETER coverage is broader than (PhD awarding) universities, to also include colleges, universities of applied sciences, and specialized institutions such as teacher training institutions, art schools, military schools, etc.
Data completeness by domain, variable and country has progressively improved over the years, and for 2020 is excellent for descriptors and geographical information (nearly 100%) and very good for students and graduates (about 95%, but some missing breakdowns); it is good for personnel data (around 80%, but lower for some breakdowns) and average for financial data (about 60%).
ETER includes several hundreds of variables and indicators, which are complemented by flags and remark fields, for a total of about 800 data fields per institution and year. Additional variables and indicators are added to the data on a rolling basis, following consultation with the ETER governance bodies -the technical support group, methodological working group and advisory board. These extensions reflect the changing needs of data users and the increasing availability of internationally comparable higher education data at the level of individual institutions.
Variables and indicators. The ETER variables and indicators have been grouped into thirteen dimensions, as described in Table 1 below.
For an up-to-date and full description of individual variable definitions, types and data sources, indicators definitions, general rules and methods of computation, as well as detailed information on flags, special codes and remarks, the reader should refer to the latest version of the ETER Handbook 43 .
Flags, special codes and notes. Special codes conforming with guidelines for the UOE data collection are applied throughout the dataset, as-in general-no blank cells are allowed in the data collection except for the 'Notes' fields. The most important codes are 'm' (data missing) and 'a' (variable is not applicable for the unit of data collection).
In addition to special codes, flags are used to identify deviant cases (in terms of format accuracy, consistency, completeness, and comparability). They are accompanied by an explanation in the 'Notes' field providing an explanation of why the value is deviant. Examples of flags are 'b' for break in series due to a change in the data collection procedures, 'de' for a break due to a demographic event, 'd' to inform the user that the definition adopted in the data collection procedures departs from the ETER Handbook 43 .

technical Validation
In order to detect possible inconsistencies among the variables delivered annually, the ETER project team developed an automated multi-level data validation and quality process (see also Fig. 2). It consists of (1) control mechanisms and prefilled data in the data collection files, (2) an automated data validation process after receiving data collection files from the data deliverer, and (3) an internal and external data quality process after the main data collection.
Control mechanisms in the data collection files. Manual checking of data in ETER has become unfeasible due to the extensive amount of data collected. The issues were manifold and included incomplete data, accuracy problems (e.g., wrong formats), or totals not matching the sums of breakdowns. Many of those cases could be spotted automatically by simple dependency rules between variables.
To address these issues, ETER has implemented control mechanisms into the data collection files utilizing conditional formatting. Whenever a value is out of the expected range of possible values, the data deliverer is alarmed by a coloured cell. Colors show data deliverer missing values (yellow), accuracy problems (orange) or issues with sums not matching totals of breakdowns (red). Their purpose is to make data deliverers aware of unusual values and prompt them to either correct or flag them.
The control mechanisms implemented in the data collection files allowed us to significantly reduce the number of inconsistencies in the data. (2023) 10:438 | https://doi.org/10.1038/s41597-023-02353-2 www.nature.com/scientificdata www.nature.com/scientificdata/ Data validation. As a second layer, data are uploaded into the database, and a script, using an API endpoint, runs on the data and produces a report for each country for internal usage. The types of checks are the same as in the data collection file (completeness, accuracy, and consistency), but the scripted data validation is more comprehensive.
The generated report per country and year shows irregularities in the data. It is used internally for examining detected cases, with each detected case is then either corrected or flagged. The data validation process is valuable for the project since it guarantees a clean dataset for ETER users.
Data quality process. The ETER data quality process is a semi-automated process consisting of scripted internal and external quality checks and interaction of the ETER project team with the data deliverer.
External data quality includes a comparison of ETER and Eurostat data. Although ETER data are based on the individual level and aggregation to country totals is not recommended (because of differences in the perimeter), the project team compares aggregated ETER data to Eurostat national data. Two aspects can signify issues in the data: (1) coverage, which should not differ too much, and (2) the development of aggregated ETER data and Eurostat totals, which should be similar. Any large deviations in coverage or the development of data are examined by the project team and, if needed, reported back to NSAs to investigate the reasons.
Multiannual checks analyze longitudinal inconsistencies for a core set of variables, such as the total number of students and the total revenues, which are expected to evolve smoothly. The approach includes two types of control mechanisms: (1) check of discontinuity, which captures multiannual volatility by identifying large variations in the values of variables, and (2) check of variance of deltas, which allows the identification of cases with moderate variations (thus not detected) but anomalous isolated jumps in the data 46 . Detected cases, especially large deviations in one year compared to all other years for a variable and HEI, can be an indication of methodological changes, demographic events, or mistakes. Detected cases are first examined by the project team, as some deviations might be known from previous years or due to demographic events; unexplained cases are reported back to the data deliverer, who are then expected to perform further checks on the data. By experience, some of these cases are explained by simple mistakes in the data collection, such as swapping rows, while others are due to changes in data collection procedures; the former are easily corrected, while the latter are systematically flagged in the database.

Dimension
Description Main data source

Basic Institutional Descriptors
Descriptive information on the HEI, such as legal status, foundation year, legal status, and national label, as well as external identifiers, matching HEIs in ETER with other datasets. www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
A full ETER dataset dump is available on the Zenodo repository in csv format 47 . The file includes the full dataset, a list of variables and codes and the full set of metadata. The current version on Zenodo is updated to the end of April 2023, and will be updated when major changes in the dataset occur.
In addition to the Zenodo dataset, the most up to date version of the data are available at the ETER platform at http://www.eter-project.com, which allows for various options for data search, selection and export indifferent formats.
While most data are publicly available once the national authorities have given publication permission, due to national confidentiality requirements, some ETER data are available only for research purposes on the condition that individual data points are not disclosed publicly. Access to restricted data is possible by registering on the website and accepting the non-disclosure agreement.
Metadata provide essential information on methodological issues, data sources and departures from definitions and, thus, are an important complement to the dataset. The metadata for all selected countries and years can be downloaded in a separate MS-Excel file.
A guidance on how to use the ETER interface for the microdata extraction can be downloaded from the ETER website, together with full documentation and the current version of the ETER Handbook 43 . A dedicated 'learn' section serves as a single-entry point to all training resources for ETER users.
Once downloaded, ETER data can be imported into most currently used software for statistical analyses, including MS Excel, SPSS, Stata, and R. A dedicated section of the ETER website provides guidance on importing the data and dealing with specific issues such as encoding non-numeric codes for missing variables.
Due to the extensive size and complexity of the dataset, which also includes scattered missing variables, with different variables missing by year and country, it is recommended to use MS Excel only for basic descriptive analyses. For more complex tasks, it is strongly advised to use statistical software such as Stata and R. The use of such software facilitates scripting tasks and enables the replication of analyses by downloading the most recent version of the dataset. For cross-sectional analyses, it might be advisable to combine different years in order to maximize availability, while interpolation techniques can be used to fill in data gaps for individual years.
An important feature of ETER is the possibility of combining it with other datasets; this is straightforward for the datasets where ETER/OrgReg identifiers have been introduced, such as the Database of External Quality Assurance Reports (DEQAR; https://www.eqar.eu/qa-results/search/by-institution/), the RISIS publication dataset maintained by the University of Leiden, and the RISIS-EUPRO dataset on participations to European Framework Programs maintained by the Austrian Institute of Technology (https://rcf.risis.io/access-request/datasets). For other datasets, name matching can be performed by combining institutional names (in English and national language), institutional acronyms, country, cities, and websites. To this aim, it is possible to use the OpenRefine open source matching tool (https://openrefine.org/) using the reconciliation API service available on Fig. 2 Overview of the ETER data validation and data quality process.